Allelotyping Methods for Massively Parallel Sequencing

ABSTRACT

In one illustrative embodiment, an allelotyping method may include selecting a plurality of text strings that each represent a nucleotide sequence that was read by a massively parallel sequencing (MPS) instrument, where the nucleotide sequences represented by the selected plurality of text strings each correspond to a particular locus, comparing the selected plurality of text strings to one another to determine an abundance count for each unique text string included in the selected plurality of text strings, and determining one or more alleles for the particular locus by comparing the abundance count for each unique text string included in the selected plurality of text strings to an abundance threshold.

CROSS-REFERENCE TO RELATED APPLICATION

The instant application is a continuation of U.S. patent applicationSer. No. 13/952,761, now U.S. Pat. No. 11,468,970, which was filed Jul.29, 2013, and is hereby incorporated by reference in its entirety. Theinstant application also claims priority to pending U.S. patentapplication Ser. No. 14/489,198, which was filed on Sep. 17, 2014, andis hereby incorporated by reference in its entirety.

SEQUENCE LISTING

The instant application contains a Sequence Listing which has beensubmitted in ASCII format via EFS-Web and is hereby incorporated byreference in its entirety. Said ASCII copy, created on Oct. 11, 2022, isnamed 920006-373507_SL.txt and is 3,550 bytes in size.

TECHNICAL FIELD

The present disclosure relates, generally, to allelotyping methods and,more particularly, to allelotyping methods for nucleotide sequence dataobtained using massively parallel sequencing (MPS).

BACKGROUND

Polymorphic tandem repeats of nucleotide sequences are found throughoutthe human genome, and the particular combinations of alleles identifiedby their multiple repeat sites are sufficiently unique to an individualthat these repeating sequences can be used in human or other organismidentification. These markers are also useful in genetic mapping andlinkage analysis, where the tandem repeat sites may be useful fordetermining, for example, predisposition for disease. Tandem repeats canbe used directly in human identity testing, such as in forensicsanalysis. There are many types of tandem repeats of nucleic acids,falling under the general term variable number tandem repeats (VNTR).For example, minisatellites and microsatellites are VNTRs, andmicrosatellites include short tandem repeats (STRs).

One application of tandem repeat analysis is in forensics or humanidentity testing. In current forensics analyses, highly polymorphic STRsare identified using a deoxyribonucleic acid (DNA) sample from anindividual and DNA amplification steps, such as polymerase chainreaction (PCR), to provide amplified samples of partial DNA sequences,or amplicons, from the individual's DNA. The amplicons can then bematched by size (i.e., repeat numbers) to reference databases, such asthe sequences stored in national or local DNA databases. For example,amplicons that originate from STR loci can be matched to reference STRdatabases, including the Federal Bureau of Investigation (FBI) CombinedDNA Index System (CODIS) database in the United States, or the NationalDNA Database (NDNAD) in the United Kingdom, to identify the individualby matching to the STR alleles specific to that individual.

Forensic DNA analysis is about to cross a threshold where DNA sampleswill begin to be analyzed routinely by massively parallel sequencing(MPS), also sometimes referred to in the art as next-generationsequencing. The advent of routine MPS for forensic DNA analysis willcreate large quantities of nucleotide sequence data that may enablericher exploitation of DNA in forensic applications. Once information isgenerated on the genetic profile of an individual (e.g., for eitherforensic investigative purposes or confirmatory matching), the resultingnucleotide sequence data should be formatted for exchange among lawenforcement entities. Moreover, forensic analysis requires thepreservation of data, including raw data, for evidentiary purposes. Datafiles created by MPS workflows are typically larger than 1 GB, makingthem difficult to transmit or store. In addition, these files, whiletext-based, are not human-readable in any practical sense because oftheir large size. Thus, other human readable forms of the data from MPSworkflows are needed.

SUMMARY

According to one aspect, an allelotyping method may comprise selecting aplurality of text strings that each represent a nucleotide sequence thatwas read by a massively parallel sequencing (MPS) instrument, where thenucleotide sequences represented by the selected plurality of textstrings each correspond to a particular locus, comparing the selectedplurality of text strings to one another to determine an abundance countfor each unique text string included in the selected plurality of textstrings, and determining one or more alleles for the particular locus bycomparing the abundance count for each unique text string included inthe selected plurality of text strings to an abundance threshold.

In some embodiments, determining the one or more alleles for theparticular locus may comprise identifying one or more unique textstrings that each represent a nucleotide sequence containing a shorttandem repeat (STR). In other embodiments, determining the one or morealleles for the particular locus may comprise identifying one or moreunique text strings that each represent a nucleotide sequence containinga single nucleotide polymorphism (SNP).

In some embodiments, comparing the abundance count for each unique textstring included in the selected plurality of text strings to theabundance threshold may comprise identifying the unique text stringhaving a highest abundance count from among the selected plurality oftext strings. Comparing the abundance count for each unique text stringincluded in the selected plurality of text strings to the abundancethreshold may further comprise calculating whether a ratio of theabundance count for each unique text string compared to the highestabundance count exceeds the abundance threshold. The abundance thresholdmay be a percentage value in the range of 15% to 60%. The abundancethreshold may be a percentage value that is configurable by a user.

In some embodiments, the allelotyping method may further comprisereceiving a first text-based computer file comprising a plurality oftext strings that each represent a nucleotide sequence that was read bythe MPS instrument, prior to selecting the plurality of text strings forwhich the represented nucleotide sequences each correspond to theparticular locus, and generating a second text-based computer filecomprising each unique text string for which the ratio exceeds theabundance threshold, where a file size of the second text-based computerfile is smaller than a file size of the first text-based computer file.The second text-based computer file may comprise one or more unique textstrings that each represent a nucleotide sequence determined to be anallele for the particular locus. The file size of the second text-basedcomputer file may be at least ten-thousand times smaller than the filesize of the first text-based computer file.

In some embodiments, the steps of (i) selecting the plurality of textstrings for which the represented nucleotide sequences each correspondto a particular locus, (ii) comparing the selected plurality of textstrings to one another to determine the abundance count for each uniquetext string included in the selected plurality of text strings, and(iii) determining one or more alleles for the particular locus bycomparing the abundance counts to the abundance threshold may beperformed for each of a plurality of loci present in a sample that wasread by the MPS instrument. Selecting the plurality of text strings forwhich the represented nucleotide sequences each correspond to one of theplurality of loci may comprise determining whether each of a pluralityof text strings generated by the MPS instrument when reading the sampleincludes text characters that represent a flanking nucleotide sequenceassociated with a particular locus and selecting a plurality of textstrings that include the text characters that represent the flankingnucleotide sequence associated with the particular locus.

In some embodiments, the allelotyping method may further compriseremoving the text characters that represent the flanking nucleotidesequence from each of the selected plurality of text strings prior tocomparing the selected plurality of text strings to one another todetermine the abundance count for each unique text string included inthe selected plurality of text strings. The allelotyping method mayfurther comprise removing all text characters that do not represent ashort tandem repeat (STR) from each of the selected plurality of textstrings prior to comparing the selected plurality of text strings to oneanother to determine the abundance count for each unique text stringincluded in the selected plurality of text strings.

According to another aspect, a computer-readable medium may store atext-based computer file comprising one or more records. Each of the oneor more records may include a first text line, a second text linecomprising a text string representing a nucleotide sequence containing asingle nucleotide polymorphism (SNP), a third text line comprising ahuman-readable allele designation for the SNP, and a fourth text line.

In some embodiments, the human-readable allele designation of the thirdtext line may comprise a number of attribute-value pairs that specify afirst SNP state, a second SNP state, an abundance count of the first SNPstate, an abundance count of the second SNP state, and a strand of thenucleotide sequence represented in the second text line. Thehuman-readable allele designation of the third text line may furthercomprise an attribute-value pair specifying a reference SNP identifierassociated with the nucleotide sequence represented in the second textline.

In some embodiments, the first text line may comprise a unique sequenceidentifier created by a massively parallel sequencing (MPS) instrumentwhen generating data related to the nucleotide sequence represented inthe second text line. The first text line may further comprise forensicmetadata specifying one or more of a file format, a unique caseidentifier, a unique sample identifier, a unique laboratory identifier,and a unique technician identifier. The fourth text line may comprise atext string representing quality scores associated with the nucleotidesequence represented in the second text line.

BRIEF DESCRIPTION OF THE DRAWINGS

The concepts described in the present disclosure are illustrated by wayof example and not by way of limitation in the accompanying figures. Forsimplicity and clarity of illustration, elements illustrated in thefigures are not necessarily drawn to scale. For example, the dimensionsof some elements may be exaggerated relative to other elements forclarity. Further, where considered appropriate, reference labels havebeen repeated among the figures to indicate corresponding or analogouselements. The detailed description particularly refers to theaccompanying figures in which:

FIG. 1 is a simplified flow diagram illustrating one embodiment of anallelotyping method;

FIG. 2 shows portions of an illustrative table including unique textstrings that each represent a nucleotide sequence (SEQ ID NOS 1-20,respectively, in order of appearance) corresponding to a particularlocus, in which the unique text strings have been sorted by abundancecount using the allelotyping method of FIG. 1 ;

FIG. 3 is a graph of the abundance counts of the top ten unique textstrings sorted by abundance from FIG. 2 ;

FIG. 4 is a simplified flow diagram illustrating one embodiment of amethod of creating a text-based computer file in the Sequence EvidenceFormat (SEF); and

FIG. 5 illustrates a portion of one embodiment of an SEF file comprisinghuman-readable allele designations for several single nucleotidepolymorphisms (SNPs) (SEQ ID NOS 21-23, respectively, in order ofappearance).

DETAILED DESCRIPTION OF THE DRAWINGS

While the concepts of the present disclosure are susceptible to variousmodifications and alternative forms, specific embodiments thereof havebeen shown by way of example in the figures and will be described hereinin detail. It should be understood, however, that there is no intent tolimit the concepts of the present disclosure to the particular formsdisclosed, but to the contrary, the intention is to cover allmodifications, equivalents, and alternatives consistent with the presentdisclosure and the appended claims.

References in the specification to “one embodiment,” “an embodiment,”“an illustrative embodiment,” etc., indicate that the embodimentdescribed may include a particular feature, structure, orcharacteristic, but every embodiment may or may not necessarily includethat particular feature, structure, or characteristic. Moreover, suchphrases are not necessarily referring to the same embodiment. Further,when a particular feature, structure, or characteristic is described inconnection with an embodiment, it is submitted that it is within theknowledge of one skilled in the art to effect such feature, structure,or characteristic in connection with other embodiments whether or notexplicitly described.

The disclosed embodiments may be implemented, in some cases, inhardware, firmware, software, or any combination thereof. The disclosedembodiments may also be implemented as instructions carried by or storedon a transitory or non-transitory computer-readable storage medium,which may be read and executed by one or more processors. Acomputer-readable storage medium may be embodied as any storage device,mechanism, or other physical structure for storing or transmittinginformation in a form readable by a computing device (e.g., a volatileor non-volatile memory, a media disc, or other media device).

In the drawings, some structural or method features may be shown inspecific arrangements and/or orderings. However, it should beappreciated that such specific arrangements and/or orderings may not berequired. Rather, in some embodiments, such features may be arranged ina different manner and/or order than shown in the illustrative figures.Additionally, the inclusion of a structural or method feature in aparticular figure is not meant to imply that such feature is required inall embodiments and, in some embodiments, may not be included or may becombined with other features.

The present disclosure relates to methods of allelotyping loci using anapproach based on matching and sorting of the reads generated by an MPSinstrument when analyzing a sample. STR loci exist in multiple states,called alleles. STR alleles always differ from one another by nucleotidesequence. In addition, alleles often differ from one another by sequencelength. Allelotyping is the process of identifying the one or morealleles present at a particular STR locus in a given sample (by way ofexample, for a sample from one human, the one allele where the locus ishomozygous or the two alleles where the locus is heterozygous). The FBIand other law enforcement agencies around the world have selected abouttwenty-four specific STR loci for use in forensic DNA analysisapplications. These STR loci have been selected because they are highlypolymorphic, i.e., multiple alleles exist within the relevantpopulations. For example, the thirteen STR loci considered by the FBI astheir “core” loci each exhibit eight to ten different common allelicforms.

A typical MPS process starts with a DNA sample, which is typicallyamplified using PCR (however, it will be appreciated that certain DNAsamples can also be directly sequenced by MPS instruments withoutpre-amplification). The output of the MPS process typically takes theform of one or more computer files containing text strings representingthe nucleotide sequence that were read by the MPS instrument. Theparallel nature of MPS results in numerous replicates of the nucleotidesequence of each allele. This is particularly true when the DNA ispre-amplified by PCR, where the number of replicates of each allelicsequence may number in the tens of thousands. However, both the PCRpre-amplification process and the sequencing process itself havenon-trivial error rates. Errors are realized as incorrect nucleotidesequences in many of the copies of the original DNA sequences, i.e., the“true” alleles that existed in the original DNA sample. These sequenceerrors may act to obscure the nucleotide sequences of the “true”alleles.

The diversity of unique nucleotide sequences generated by a PCR-MPSworkflow can be illustrated using an experiment performed on a DNAsample from an anonymous human subject. This anonymous human subject wasknown to contain allele numbers 9 and 10, (with known sequences [AGAT]₉(SEQ ID NO: 1) and [AGAT]₁₀ (SEQ ID NO: 2), respectively) at the CSF1POlocus. The MPS instrument returned approximately 20,000 replicatesequences for the two alleles at this locus (i.e., an average of 10,000replicate sequences per allele). In the absence of error, only these two“true” allele nucleotide sequences would be returned in the datagenerated by the MPS instrument. However, because of error in the PCRamplification and MPS processes, a total of 598 unique nucleotidesequences were observed. Two correspond to the “true” alleles, and theremaining 596 correspond to method artifacts and sequencing errors. FIG.2 shows twenty of the 598 unique nucleotide sequences output by the MPSinstrument (the top ten most abundant nucleotide sequences, and ten“singleton” nucleotide sequences that each occurred only once), whileFIG. 3 shows a graph of the number of reads by the MPS instrument forthe top ten most abundant nucleotide sequences. The two most abundantunique nucleotide sequences correspond to the “true” alleles (i.e.,allele numbers 9 and 10). The next most abundant unique sequencescorrespond to PCR stutter artifacts (i.e., the insertion or deletion ofrepeat motifs in nucleotide sequences due to strand slippage in the PCRprocess). The less abundant sequences exhibit a wide range of mutations,including sequence truncations and base substitution errors.

Referring now to FIG. 1 , one illustrative embodiment of an allelotypingmethod 100 that may be used to determine the one or more “true” allelesfor each locus of a sample analyzed by an MPS instrument is shown as asimplified flow diagram. In some embodiments, the allelotyping method100 may be implemented as one or more software routines executed by oneor more processors. By way of example, it is contemplated that theallelotyping method 100 may be performed by a processor of the MPSinstrument analyzing the sample, in some embodiments, or by a processorof a separate computing device that receives nucleotide sequence dataoutput by the MPS instrument (for instance, in the form of a FASTA orFASTQ file), in other embodiments. The allelotyping method 100 isillustrated in FIG. 1 as a number of blocks 102-114, which will bedescribed in detail below. It will be appreciated that, although theallelotyping method 100 is generally shown in FIG. 1 and described belowas having a linear workflow, the allelotyping method 100 may beimplemented using any number of workflows (including, by way of example,workflows in which various portions of the allelotyping method 100 areperformed in parallel).

The allelotyping method 100 begins with block 102 in which a number oftext strings that each represent a nucleotide sequence that was read byan MPS instrument are received. In these text strings, each characterrepresents one base of a nucleotide sequence read by the MPS instrument.As noted above, due to the parallel nature of the MPS process, onesample may result in tens or hundreds of thousands of reads and, hence,associated text strings. Many MPS instruments output this data in filesthat follow either the FASTA or FASTQ format. As such, in someembodiments of the allelotyping method 100, block 102 may involvereceiving such a FASTA or FASTQ file and reading a number of textstrings that each represent a nucleotide sequence from that file. Inother embodiments, block 102 may involve receiving files in differentformats (other than FASTA or FASTQ).

After block 102, the allelotyping method 100 proceeds to block 104 inwhich a bioinformatic matching procedure is used to determine whichreads from the MPS instrument (represented by the received text strings)correspond to one or more particular loci present in the sample. In theillustrative embodiment of block 104, each of the received text stringsis evaluated to determine whether it includes characters that representa known flanking nucleotide sequence associated with a particular locus.The nucleotide sequence of each STR locus is generally flanked, on boththe 5′ and 3′ ends, by particular flanking nucleotide sequences. Thesecomplex nucleotide sequences can be rare or unique in the DNA analyzedby the MPS instrument. As such, the presence of one (or both) of theseflanking nucleotide sequences may indicate the presence of an STR locusin the nucleotide sequence represented by the text string. It iscontemplated that, in other embodiments, the text strings may besearched for other specific nucleotide sequences (e.g., the 5′ primersequences used to PCR amplify each STR locus).

The flanking nucleotide sequences, or other specific nucleotidesequences, used in block 104 may be any length. Longer sequences aremore likely to be unique, but increasing length should be balancedagainst the slower bioinformatic processing inherent to long-sequencematching routines. Moreover, the longer a sequence, the more likely thatreads which should contain the sequence will contain an error, reducingthe sequence's value as an identifier. In some embodiments, block 104may reference an external, updatable library of flanking nucleotidesequences (or other specific nucleotide sequences). The text stringmatching routine used in block 104 can be stringent (e.g., exactcharacter matching) or flexible (e.g., close, overall matching). Block104 may use any number of bioinformatic approaches, including, but notlimited to, string matching routines built into common programminglanguages (e.g., Java, C++, Perl or Python), regular expression routinesbuilt into some programming languages (e.g., Perl) or available asmodules for other programming languages, or the like. In someembodiments of block 104, each text string determined to correspond to aparticular locus may be grouped or segregated by locus. In otherembodiments, each text string determined to correspond to a particularlocus may be metadata tagged with that particular locus. In someembodiments, text strings that are not identified as corresponding toany locus (e.g., where the text string does not contain a known flankingnucleotide sequence) may be discarded.

After block 104, the allelotyping method 100 proceeds to block 106 inwhich the text strings representing nucleotide sequences associated witha particular locus are selected for further processing. In other words,blocks 106-114 of the allelotyping method 100 are performed on alocus-by-locus basis. As indicated in the illustrative embodiment ofFIG. 1 , blocks 106-114 may be repeated for each locus identified aspresent in the sample in block 104. In some embodiments, block 106 mayinvolve selecting a locus-specific list containing all of the textstrings determined in block 104 to contain characters that represent aknown flanking nucleotide sequence associated with a particular locus.In other embodiments, block 106 may involve selecting all text stringsassigned a particular metadata tag in block 104.

After block 106, the allelotyping method 100 proceeds to block 108 inwhich each of the text strings selected in block 106 are trimmed toremove characters that do not represent the nucleotide sequence ofinterest. For instance, in some embodiments, all characters except thoserepresenting an STR may be removed from each of the selected teststrings. In other embodiments, all characters except those representinga nucleotide sequence containing a SNP may be removed from each of theselected test strings. In the illustrative embodiment of block 108, thecharacters representing the 5′ and 3′ flanking nucleotide sequences(used to identify the text string in block 104), as well as anycharacters outside of the characters representing the 5′ and 3′ flankingnucleotide sequences, are removed from each of the selected teststrings. After block 108, the text strings contain only the allelesproper, but remain identified with a particular locus (due to thegrouping, metadata tagging, or other segregation procedure performed inblock 104). It is contemplated that, in some embodiments of theallelotyping method 100, block 108 may be omitted, keeping thecharacters that do not represent the nucleotide sequence of interest butignoring those characters in blocks 110-114.

After block 108, the allelotyping method 100 proceeds to block 110 inwhich each unique text string included in the selected text strings(associated with a particular locus) are grouped and counted. In otherwords, block 110 determines an abundance count for each nucleotidesequence represented by the selected text strings. In the illustrativeembodiment, block 110 involves comparing the selected text strings toone another to determine each unique text string and its abundance inthe group of selected text strings. Block 110 may be performed byapplying a stringent text string matching routine (e.g., exact charactermatching) to the selected text strings. As noted above, it iscontemplated that string matching routines built into common programminglanguages (e.g., Java, C++, Perl or Python), regular expression routinesbuilt into some programming languages (e.g., Perl) or available asmodules for other programming languages, or the like. Block 110 may alsoinvolve sorting the unique text strings, each representing a uniquenucleotide sequence read by the MPS instrument, into a list according totheir abundance counts. FIG. 2 illustrates top and bottom portions ofsuch a list, in which the 598 unique nucleotide sequences output by anMPS instrument and determined to be associated with the CSF1PO locuswere sorted by abundance count.

It will be appreciated that the exact-matching scheme in block 110 willresult in an abundance count (i.e., a number of occurrences) for eacheven slightly different nucleotide sequence. For instance, where a textstring represents a nucleotide sequence containing a SNP, theallelotyping method 100 will independently count this unique text string(and will not group it with a different text string representing asimilar nucleotide sequence not exhibiting the SNP). This approach isquite different from prior alignment-based allelotyping methods, whichattempt to identify which reference sequence each portion of nucleotidesequence data most resembles.

After block 110, the allelotyping method 100 proceeds to block 112 inwhich the abundance count for each unique text string determined inblock 110 is compared to one or more abundance thresholds. As noted,while the MPS workflow results in many unique nucleotide sequences for aparticular locus, one or two of these nucleotide sequences will be the“true” alleles for that particular locus (in the case of a sample fromone human). As explained further below, comparison of each of theabundance counts determined in block 110 to an abundance threshold mayallow determination of the one or more alleles for that particularlocus. In the illustrative embodiment, block 112 involves identifyingthe unique text string having the highest abundance count from among theselected text strings. The nucleotide sequence represented by this textstring will be one allele for the associated locus, because despite thenon-trivial error rates, PCR and MPS workflows exhibit sufficientfidelity to generate correct sequences in abundance. The illustrativeembodiment of block 112 also involves calculating a ratio of theabundance count for each unique text string as compared to that highestabundance count. It may then be determined whether the ratio calculatedfor each unique text string exceeds the abundance threshold(s). In someembodiments of block 112, each abundance threshold may be apredetermined percentage value. For instance, the abundance thresholdmay be a percentage value in the range of 15% to 60%, as explainedfurther below. In other embodiments, the abundance threshold may be apercentage value that is configurable by a user of the allelotypingmethod 100.

After block 112, the allelotyping method 100 proceeds to block 114 inwhich the one or more alleles for the particular locus being consideredare determined based on the comparisons performed in block 112. By wayof example, where the sample analyzed by the MPS instrument is known tocontain DNA from a single human source, the abundance threshold used inblock 112 may be set to a percentage value in the range of about 50% toabout 60%. It is known that, for sister alleles at a heterozygous locus,the lesser abundant allele will typically have a ratio of about 50-60%or greater to the more abundant allele (i.e., the allele represented bythe text string with the highest abundance count). As such, if a secondtext string is determined in block 112 to have an abundance countexceeding about 50-60% of the highest abundance count, block 114 mayconclude that the locus is heterozygous and that the second text stringrepresents the lesser abundant sister allele. Alternatively, if no textstring is determined in block 112 to have an abundance count exceedingabout 50-60% of the highest abundance count, block 114 may conclude thatthe locus is homozygous and that the text string with the highestabundance count represents the only allele at that particular locus. Asanother example, in cases where it is unknown if the sample analyzed bythe MPS instrument contained DNA from multiple sources, the abundancethreshold used in block 112 may be set to a percentage value of about15% (or higher). This abundance threshold may capture alleles from asecondary source that is present in a lower amount in the sample, whileavoiding detection of artifacts caused by PCR stutter (which tend to beabout 4-15% of the corresponding allele abundance count). In yet anotherembodiment, the abundance threshold might be set to a percentage valueof about 4% to capture PCR stutter artifacts but avoid background noise(which tends to be <4% of the highest abundance count).

Referring now to FIG. 4 , one illustrative embodiment of a method 400 ofcreating a text-based computer file 410 in a Sequence Evidence Format(SEF) is shown as a simplified flow diagram. Certain illustrativeembodiments of the SEF are described in detail in U.S. patentapplication Ser. No. 13/834,830, filed Mar. 15, 2013, and entitled“Computer Files and Methods Supporting Forensic Analysis of NucleotidesSequence Data,” the entire disclosure of which is incorporated herein byreference (the foregoing application generally referring to the SEF asthe “FASTF file format”). Additional illustrative embodiments of the SEFare further described below (with reference to FIG. 5 ).

The method 400 is intended for use in the forensic DNA analysis processto create an SEF file 410 that may serve as an evidentiary record. Asshown in FIG. 4 , an SEF file creator 408 may ingest data from at leasttwo different sources. Raw nucleotide sequence data, quality scoresassociated with the nucleotide sequence data, and instrument-specificmetadata may be ingested from a FASTQ file 404 generated by an MPSinstrument 402 (and its associated software). Each record of the FASTQfile 404 includes a text string that represents a nucleotide sequencethat was read by the MPS instrument 402, with each of the characters ofthat text string representing an output of a base call algorithmperformed by the MPS instrument 402. The SEF file creator 408 may alsoingest a case file 406 received from a laboratory case managementsystem. In other embodiments of the method 400, the case managementinformation may be manually input by a user of the MPS instrument 402via a user interface. Using this input, the SEF file creator 408 maywrite forensic metadata to each record of the SEF file 410.

The SEF file creator 408 may perform processing on the raw nucleotidesequence data ingested from the FASTQ file 404 when generating the SEFfile 410. In the illustrative embodiment of the method 400, the SEF filecreator 408 performs the allelotyping method 100 on the nucleotidesequence data in the FASTQ file 404. In this embodiment, block 102 ofthe allelotyping method 100 involves receiving the FASTQ file 404, whichcontains a number of text strings that each represent a nucleotidesequence read by the MPS instrument 402. Upon completion of theallelotyping method 100, the SEF file creator 408 may write each uniquetext string determined in block 112 of the allelotyping method 100 toexceed the abundance threshold to the SEF file 410. In some embodiments,the SEF file creator 408 may write each unique text string determined inblock 114 of the allelotyping method 100 to represent an allele for aparticular locus to the SEF file 410.

It will be appreciated that, as a result of the allelotyping method 100,the SEF file 410 generated by the SEF file creator 408 may have a filesize that is significantly smaller than that of the FASTQ file 404.While the FASTQ file 404 contains an individual, four-line record forevery read performed by the MPS instrument 402, the SEF file 410contains only one four-line record for each “true” allele in the sample(and any other nucleotide sequences of interest). As such, using theallelotyping method 100, the SEF file creator 408 is able to “compress”the FASTQ file 404 into the SEF file 410 while retaining all of theinformation important for forensic analysis. Table 1 below sets forthfive examples of FASTQ files 404 that were processed by the SEF filecreator 408 to generate SEF files 410. As can be seen from these fiveexamples, the allelotyping method 100 can produce an SEF file 410 with afile size that is at least 10,000 times smaller than the file size ofthe source FASTQ file 404.

TABLE 1 FASTQ File SEF File Flank Size Min. Stack Processing Size (kB)Size (kB) (nucleotides) Coverage Time (Min.) 1,723,996 20 10.10 4 2:05697,339 18 10.10 4 0:45 697,339 12 10.10 50 0:39 1,794,807 13 10.10 502:04 1,414,563 11 10.10 50 1:37

Several records of one illustrative embodiment of an SEF file 410 areillustrated in FIG. 5 . Specifically, the illustrative embodiment of theSEF file 410 shown in FIG. 5 comprises data and metadata relating toseveral nucleotide sequences that contain a SNP. The SEF file 410 ofFIG. 5 conforms to the file format specification outlined in Table 2,which references the FASTQ file format specification (the third columndenoting several distinguishing features of this SEF file 410, ascompared to FASTQ files 404). It will be appreciated that, in otherembodiments, SEF files 410 may be formatted according to additional ordifferent specifications than those set forth in Table 2.

TABLE 2 Line Type FASTQ SEF (Differences from FASTQ) 1) Title Firstcharacter must be The systematic identifier produced and De- “@” by MPSinstrument/software may scription Free format field with comprise theinitial portion of no length limitation this field Arbitrary content canForensic metadata as a series of be included optional attribute-valuepairs 2) No specific initial Only International Union of Pure Sequencecharacter is required and Applied Chemistry (IUPAC) for this line typenucleotide base codes are permitted Any printable charac-(“ACTGNURYSWKMBDHVN.—”) ter is permitted Upper case, lower case, andmixed case accepted Can be line wrapped 3) End of First character mustbe Content after “+” does not Sequence “+” have to match the title lineIf the title line is Human-readable allele designation repeated, it mustbe for SNP or STR alleles as a series of identical optionalattribute-value pairs 4) Quality Accepts printable Averaged PHRED scoreswith an Scores ASCII characters ASCII offset of 33 33-126 Can be linewrapped Length must be equal to sequence length

Similar to FASTQ files, the SEF file 410 of FIG. 5 comprises a number ofrecords, where each record includes data in four text lines. The secondtext line of each record contains a text string representing anucleotide sequence containing a SNP. In some embodiments, this textstring in the second line of each record of the SEF file 410 may be anoutput of the allelotyping method 100 used to identify the SNP alleles.In the illustrative embodiment of FIG. 5 , the SNP nucleotide is printedas a lower case character flanked by upper case characters for theremainder of the nucleotide sequence. The fourth text line of eachrecord may contain average quality scores for the nucleotide sequencesrepresented by the text string of the second text line. The first andthird text lines of each record may contain metadata related to casemanagement and to evidence preservation.

In some embodiments, the forensic metadata included in the first textline of each record of the SEF file 410 may be in the form of one ormore attribute-value pairs 500. In the illustrative embodiment of FIG. 5, the first text line of each record contains several attribute-valuepairs 500. In some embodiments, the attribute-value pairs 500 mayspecify one or more of a file format, a unique case identifier, a uniquesample identifier, a unique laboratory identifier, and a uniquetechnician identifier associated with the SEF file 410. Severalillustrative attribute-value pairs 500 are set forth in Table 3 below,including possible permissible values and intended purposes. The firsttext line of each record of the SEF file 410 may also include a uniquesequence identifier 502 created by the MPS instrument 402.

TABLE 3 Permissible Attribute Values Purpose file X: = [0-9] X refers tomajor releases and Y to Y: = [0-9] minor releases caseID [A-Za-z0-9_.:—]Unique case identity numbers for case management sampleID[A-Za-z0-9_.:—] Unique sample identity numbers for case management labID[A-Za-z0-9_.:—] Unique laboratory identity numbers for case managementtechID [A-Za-z0-9_.:—] Unique technician identity numbers for casemanagement

In the illustrative embodiment of FIG. 5 , the third text line of eachrecord of the SEF file 410 comprises a human-readable allele designationfor the SNP in the form of a number of attribute-value pairs 504, 506.Several illustrative attribute-value pairs 504, 506 that may be used aspart of the human-readable allele designation for the SNP are set forthin Table 4 below, including possible permissible values and intendedpurposes. The human readable allele designation for the SNP of eachrecord strikes a balance between human and computer readability. Oneattribute-value pair 504 of the human readable allele designation forthe SNP may specify a reference SNP identifier, as reported in theNational Center for Biotechnology Information (NCBI) SNP database. Whereno reference SNP identifier is available, this attribute-value pair 504may specify “unknown.” Additionally, a series of attribute-value pairs506 of the human readable allele designation for the SNP may specify thestate of the allele. For instance, the human readable allele designationmay include attribute-value pairs 506 that specify a first SNP state, asecond SNP state, an abundance count of the first SNP state, anabundance count of the second SNP state, and a strand of the nucleotidesequence represented in the second text line. In the illustrativeembodiment, the series of attribute-value pairs 506 that specify thestate of the allele follow the conventions described in Illumina, Inc.,“Technical Note: ‘TOP/BOT’ Strand and ‘A/B’ Allele,” Pub. No.370-2006-018 (Jun. 27, 2006), the entire disclosure of which isincorporated herein by reference.

TABLE 4 Permissible Attribute Values Purpose SNP [A-Za-z0-9] ReferenceSNP identifier, as reported in NCBI SNP database (can be “unknown”)allele_A [G, A, T, C, SNP allele state (following Illumina g, a, t, c]A/B convention) allele_B [G, A, T, C, SNP allele state (followingIllumina g, a, t, c] A/B convention) Strand [A-Za-z] Strand where SNP isfound (following Illumina TOP/BOT convention). allele_A_count [0-9]Abundance count for allele_A allele_B_count [0-9] Abundance count forallele_B

While certain illustrative embodiments have been described in detail inthe figures and the foregoing description, such an illustration anddescription is to be considered as exemplary and not restrictive incharacter, it being understood that only illustrative embodiments havebeen shown and described and that all changes and modifications thatcome within the spirit of the disclosure are desired to be protected.There are a plurality of advantages of the present disclosure arisingfrom the various features of the methods, systems, and articlesdescribed herein. It will be noted that alternative embodiments of themethods, systems, and articles of the present disclosure may not includeall of the features described yet still benefit from at least some ofthe advantages of such features. Those of ordinary skill in the art mayreadily devise their own implementations of the methods, systems, andarticles that incorporate one or more of the features of the presentdisclosure.

1. An allelotyping method comprising: selecting a plurality of textstrings that each represent a nucleotide sequence that was read by amassively parallel sequencing (MPS) instrument, wherein the nucleotidesequences represented by the selected plurality of text strings eachcorrespond to a particular locus; comparing the selected plurality oftext strings to one another to determine an abundance count for eachunique text string included in the selected plurality of text strings;and determining one or more alleles for the particular locus bycomparing the abundance count for each unique text string included inthe selected plurality of text strings to an abundance threshold.
 2. Theallelotyping method of claim 1, wherein determining the one or morealleles for the particular locus comprises identifying one or moreunique text strings that each represent a nucleotide sequence containinga short tandem repeat (STR).
 3. The allelotyping method of claim 1,wherein determining the one or more alleles for the particular locuscomprises identifying one or more unique text strings that eachrepresent a nucleotide sequence containing a single nucleotidepolymorphism (SNP).
 4. The allelotyping method of claim 1, whereincomparing the abundance count for each unique text string included inthe selected plurality of text strings to the abundance thresholdcomprises identifying the unique text string having a highest abundancecount from among the selected plurality of text strings.
 5. Theallelotyping method of claim 4, wherein comparing the abundance countfor each unique text string included in the selected plurality of textstrings to the abundance threshold further comprises calculating whethera ratio of the abundance count for each unique text string compared tothe highest abundance count exceeds the abundance threshold.
 6. Theallelotyping method of claim 5, wherein the abundance threshold is apercentage value in the range of 15% to 60%.
 7. The allelotyping methodof claim 5, wherein the abundance threshold is a percentage valueconfigurable by a user.
 8. The allelotyping method of claim 5, furthercomprising: receiving a first text-based computer file comprising aplurality of text strings that each represent a nucleotide sequence thatwas read by the MPS instrument, prior to selecting the plurality of textstrings for which the represented nucleotide sequences each correspondto the particular locus; and generating a second text-based computerfile comprising each unique text string for which the ratio exceeds theabundance threshold, wherein a file size of the second text-basedcomputer file is smaller than a file size of the first text-basedcomputer file.
 9. The allelotyping method of claim 8, wherein the secondtext-based computer file comprises one or more unique text strings thateach represent a nucleotide sequence determined to be an allele for theparticular locus.
 10. The allelotyping method of claim 8, wherein thefile size of the second text-based computer file is at leastten-thousand times smaller than the file size of the first text-basedcomputer file.
 11. The allelotyping method of claim 1, wherein the stepsof (i) selecting the plurality of text strings for which the representednucleotide sequences each correspond to a particular locus, (ii)comparing the selected plurality of text strings to one another todetermine the abundance count for each unique text string included inthe selected plurality of text strings, and (iii) determining one ormore alleles for the particular locus by comparing the abundance countsto the abundance threshold are performed for each of a plurality of locipresent in a sample that was read by the MPS instrument.
 12. Theallelotyping method of claim 11, wherein selecting the plurality of textstrings for which the represented nucleotide sequences each correspondto one of the plurality of loci comprises: determining whether each of aplurality of text strings generated by the MPS instrument when readingthe sample includes text characters that represent a flanking nucleotidesequence associated with a particular locus; and selecting a pluralityof text strings that include the text characters that represent theflanking nucleotide sequence associated with the particular locus. 13.The allelotyping method of claim 12, further comprising removing thetext characters that represent the flanking nucleotide sequence fromeach of the selected plurality of text strings prior to comparing theselected plurality of text strings to one another to determine theabundance count for each unique text string included in the selectedplurality of text strings.
 14. The allelotyping method of claim 1,further comprising removing all text characters that do not represent ashort tandem repeat (STR) from each of the selected plurality of textstrings prior to comparing the selected plurality of text strings to oneanother to determine the abundance count for each unique text stringincluded in the selected plurality of text strings.
 15. Acomputer-readable medium storing a text-based computer file, thetext-based computer file comprising: one or more records, each of theone or more records including: a first text line; a second text linecomprising a text string representing a nucleotide sequence containing asingle nucleotide polymorphism (SNP); a third text line comprising ahuman-readable allele designation for the SNP; and a fourth text line.16. The computer-readable medium of claim 15, wherein, for each of theone or more records of the text-based file, the human-readable alleledesignation of the third text line comprises a number of attribute-valuepairs that specify a first SNP state, a second SNP state, an abundancecount of the first SNP state, an abundance count of the second SNPstate, and a strand of the nucleotide sequence represented in the secondtext line.
 17. The computer-readable medium of claim 15, wherein, foreach of the one or more records of the text-based file, thehuman-readable allele designation of the third text line furthercomprises an attribute-value pair specifying a reference SNP identifierassociated with the nucleotide sequence represented in the second textline.
 18. The computer-readable medium of claim 15, wherein, for each ofthe one or more records of the text-based file, the first text linecomprises a unique sequence identifier created by a massively parallelsequencing (VIPS) instrument when generating data related to thenucleotide sequence represented in the second text line.
 19. Thecomputer-readable medium of claim 18, wherein, for each of the one ormore records of the text-based file, the first text line furthercomprises forensic metadata specifying one or more of a file format, aunique case identifier, a unique sample identifier, a unique laboratoryidentifier, and a unique technician identifier.
 20. Thecomputer-readable medium of claim 15, wherein, for each of the one ormore records of the text-based file, the fourth text line comprises atext string representing quality scores associated with the nucleotidesequence represented in the second text line.